Note, before using knitr please install all missing packages from the code cell below into your environment. Also, I recommended knitting into HTML as it has been optimized for viewing in that format.
This data was procured from https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic and will be updated annually at the source and any time this .rmd file is knit.
From the source: “List of every shooting incident that occurred in NYC going back to 2006 through the end of the previous calendar year.
This is a breakdown of every shooting incident that occurred in NYC going back to 2006 through the end of the previous calendar year. This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning before being posted on the NYPD website. Each record represents a shooting incident in NYC and includes information about the event, the location and time of occurrence. In addition, information related to suspect and victim demographics is also included. This data can be used by the public to explore the nature of shooting/criminal activity. Please refer to the attached data footnotes for additional information about this dataset.”
For more information on the details of this dataset it is recommended to follow the link and access the footnotes pdf found on the landing page website.
We will first begin by loading in the R packages we intend to use.
Then, we will import the data using a URL directly from the source, this ensures we will capture updates to the data as they come in, whenever this is run again.
# Output all commands run and set a standard plot size
knitr::opts_chunk$set(echo = TRUE, fig.width = 10, fig.height = 6)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
import_url <- read.csv("https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD")
Let’s take a look at the dimensions of this imported data.frame, as well as the variable types of each column, and output a summary.
dim(import_url)
## [1] 25596 19
str(import_url)
## 'data.frame': 25596 obs. of 19 variables:
## $ INCIDENT_KEY : int 236168668 231008085 230717903 237712309 224465521 228252164 226950018 237710987 224701998 225295736 ...
## $ OCCUR_DATE : chr "11/11/2021" "07/16/2021" "07/11/2021" "12/11/2021" ...
## $ OCCUR_TIME : chr "15:04:00" "22:05:00" "01:09:00" "13:42:00" ...
## $ BORO : chr "BROOKLYN" "BROOKLYN" "BROOKLYN" "BROOKLYN" ...
## $ PRECINCT : int 79 72 79 81 113 113 42 52 34 75 ...
## $ JURISDICTION_CODE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LOCATION_DESC : chr "" "" "" "" ...
## $ STATISTICAL_MURDER_FLAG: chr "false" "false" "false" "false" ...
## $ PERP_AGE_GROUP : chr "" "45-64" "<18" "" ...
## $ PERP_SEX : chr "" "M" "M" "" ...
## $ PERP_RACE : chr "" "ASIAN / PACIFIC ISLANDER" "BLACK" "" ...
## $ VIC_AGE_GROUP : chr "18-24" "25-44" "25-44" "25-44" ...
## $ VIC_SEX : chr "M" "M" "M" "M" ...
## $ VIC_RACE : chr "BLACK" "ASIAN / PACIFIC ISLANDER" "BLACK" "BLACK" ...
## $ X_COORD_CD : num 996313 981845 996546 1001139 1050710 ...
## $ Y_COORD_CD : num 187499 171118 187436 192775 184826 ...
## $ Latitude : num 40.7 40.6 40.7 40.7 40.7 ...
## $ Longitude : num -74 -74 -74 -73.9 -73.8 ...
## $ Lon_Lat : chr "POINT (-73.95650899099996 40.68131820000008)" "POINT (-74.00866668999998 40.63636384100005)" "POINT (-73.95566903799994 40.68114495900005)" "POINT (-73.939095905 40.69579171600003)" ...
summary(import_url)
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME BORO
## Min. : 9953245 Length:25596 Length:25596 Length:25596
## 1st Qu.: 61593633 Class :character Class :character Class :character
## Median : 86437258 Mode :character Mode :character Mode :character
## Mean :112382648
## 3rd Qu.:166660833
## Max. :238490103
##
## PRECINCT JURISDICTION_CODE LOCATION_DESC STATISTICAL_MURDER_FLAG
## Min. : 1.00 Min. :0.0000 Length:25596 Length:25596
## 1st Qu.: 44.00 1st Qu.:0.0000 Class :character Class :character
## Median : 69.00 Median :0.0000 Mode :character Mode :character
## Mean : 65.87 Mean :0.3316
## 3rd Qu.: 81.00 3rd Qu.:0.0000
## Max. :123.00 Max. :2.0000
## NA's :2
## PERP_AGE_GROUP PERP_SEX PERP_RACE VIC_AGE_GROUP
## Length:25596 Length:25596 Length:25596 Length:25596
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## VIC_SEX VIC_RACE X_COORD_CD Y_COORD_CD
## Length:25596 Length:25596 Min. : 914928 Min. :125757
## Class :character Class :character 1st Qu.:1000011 1st Qu.:182782
## Mode :character Mode :character Median :1007715 Median :194038
## Mean :1009455 Mean :207894
## 3rd Qu.:1016838 3rd Qu.:239429
## Max. :1066815 Max. :271128
##
## Latitude Longitude Lon_Lat
## Min. :40.51 Min. :-74.25 Length:25596
## 1st Qu.:40.67 1st Qu.:-73.94 Class :character
## Median :40.70 Median :-73.92 Mode :character
## Mean :40.74 Mean :-73.91
## 3rd Qu.:40.82 3rd Qu.:-73.88
## Max. :40.91 Max. :-73.70
##
Looks like we have 19 columns (features) and 25596
rows (data points).
First, let’s remove any features that we won’t be needing for our
analysis.
JURISDICTION_CODE is pretty broad for localizing
shooting incidents so we will end up using BORO instead
which will give more insight to our analysis.X_COORD_CD, Y_COORD_CD, and
Lon_Lat are all redundant as we will use
LONGITUDE and LATITUDE in their place.Also, let’s rename a few of these for more readability.
# Remove features
import_url <- select(import_url, -JURISDICTION_CODE, -X_COORD_CD, -Y_COORD_CD, -Lon_Lat)
# Rename features
import_url <- import_url %>%
rename(c('DATE' = 'OCCUR_DATE', 'TIME' = 'OCCUR_TIME','BOROUGH' = 'BORO',
'LOCATION' = 'LOCATION_DESC', 'MURDER_FLAG' = 'STATISTICAL_MURDER_FLAG',
'PERP_AGE' = 'PERP_AGE_GROUP', 'VICTIM_AGE' = 'VIC_AGE_GROUP', 'VICTIM_SEX' = 'VIC_SEX',
'VICTIM_RACE' = 'VIC_RACE', 'LATITUDE' = 'Latitude', 'LONGITUDE' = 'Longitude'))
head(import_url)
## INCIDENT_KEY DATE TIME BOROUGH PRECINCT LOCATION MURDER_FLAG
## 1 236168668 11/11/2021 15:04:00 BROOKLYN 79 false
## 2 231008085 07/16/2021 22:05:00 BROOKLYN 72 false
## 3 230717903 07/11/2021 01:09:00 BROOKLYN 79 false
## 4 237712309 12/11/2021 13:42:00 BROOKLYN 81 false
## 5 224465521 02/16/2021 20:00:00 QUEENS 113 false
## 6 228252164 05/15/2021 04:13:00 QUEENS 113 true
## PERP_AGE PERP_SEX PERP_RACE VICTIM_AGE VICTIM_SEX
## 1 18-24 M
## 2 45-64 M ASIAN / PACIFIC ISLANDER 25-44 M
## 3 <18 M BLACK 25-44 M
## 4 25-44 M
## 5 25-44 M
## 6 25-44 M
## VICTIM_RACE LATITUDE LONGITUDE
## 1 BLACK 40.68132 -73.95651
## 2 ASIAN / PACIFIC ISLANDER 40.63636 -74.00867
## 3 BLACK 40.68114 -73.95567
## 4 BLACK 40.69579 -73.93910
## 5 BLACK 40.67374 -73.76041
## 6 BLACK 40.70618 -73.75806
Next, we will check if there are any missing or duplicated data
points, focusing only on the INCIDENT_KEY feature for now.
This feature will be the most important for identifying any duplicate
entries as they should all be unique.
# Check for any NA or Null values
any(is.na(import_url$INCIDENT_KEY)) | any(is.null(import_url$INCIDENT_KEY))
## [1] FALSE
# Check for duplicates
length(unique(import_url$INCIDENT_KEY))
## [1] 20126
length(import_url$INCIDENT_KEY)
## [1] 25596
Subtracting the results here shows that there are 5470 duplicate data points! Let’s take a closer look to make sure these aren’t false positives.
# Query duplicates to see what they look like
head(filter(import_url, duplicated(import_url$INCIDENT_KEY)))
## INCIDENT_KEY DATE TIME BOROUGH PRECINCT LOCATION
## 1 229643172 06/16/2021 23:22:00 BRONX 52
## 2 236363733 11/16/2021 22:39:00 MANHATTAN 14
## 3 226542152 04/05/2021 22:10:00 BRONX 44 MULTI DWELL - PUBLIC HOUS
## 4 227647476 05/02/2021 18:18:00 MANHATTAN 23 MULTI DWELL - PUBLIC HOUS
## 5 232496781 08/19/2021 20:32:00 BROOKLYN 77
## 6 232390408 08/17/2021 22:20:00 BROOKLYN 73 GROCERY/BODEGA
## MURDER_FLAG PERP_AGE PERP_SEX PERP_RACE VICTIM_AGE VICTIM_SEX
## 1 false 18-24 M WHITE HISPANIC 25-44 F
## 2 false <18 M BLACK 25-44 M
## 3 false 45-64 M
## 4 false 18-24 M WHITE HISPANIC 25-44 M
## 5 false 45-64 M BLACK 45-64 F
## 6 false 18-24 M
## VICTIM_RACE LATITUDE LONGITUDE
## 1 BLACK 40.86414 -73.89131
## 2 BLACK 40.75165 -73.98434
## 3 BLACK 40.83750 -73.92785
## 4 BLACK 40.78694 -73.94357
## 5 BLACK 40.67036 -73.92680
## 6 BLACK 40.66835 -73.90652
# Check a few entries
arrange(filter(import_url, INCIDENT_KEY == 227647476 | INCIDENT_KEY == 232390408), INCIDENT_KEY)
## INCIDENT_KEY DATE TIME BOROUGH PRECINCT LOCATION
## 1 227647476 05/02/2021 18:18:00 MANHATTAN 23 MULTI DWELL - PUBLIC HOUS
## 2 227647476 05/02/2021 18:18:00 MANHATTAN 23 MULTI DWELL - PUBLIC HOUS
## 3 227647476 05/02/2021 18:18:00 MANHATTAN 23 MULTI DWELL - PUBLIC HOUS
## 4 232390408 08/17/2021 22:20:00 BROOKLYN 73 GROCERY/BODEGA
## 5 232390408 08/17/2021 22:20:00 BROOKLYN 73 GROCERY/BODEGA
## MURDER_FLAG PERP_AGE PERP_SEX PERP_RACE VICTIM_AGE VICTIM_SEX
## 1 false <18 M BLACK 25-44 M
## 2 false 18-24 M WHITE HISPANIC 25-44 M
## 3 false 18-24 M BLACK 25-44 M
## 4 false 25-44 M
## 5 false 18-24 M
## VICTIM_RACE LATITUDE LONGITUDE
## 1 BLACK 40.78694 -73.94357
## 2 BLACK 40.78694 -73.94357
## 3 BLACK 40.78694 -73.94357
## 4 BLACK 40.66835 -73.90652
## 5 BLACK 40.66835 -73.90652
# Yes those are definitely duplicates
# Remove duplicates
import_url <- filter(import_url, !duplicated(import_url$INCIDENT_KEY))
# Check work
length(duplicated(import_url$INCIDENT_KEY))
## [1] 20126
That should do it for the duplicated data points. Let’s continue our transformations.
For better analysis we should change the class type of a few of these features to make them easier to work with.
# Character to Date and Period
import_url <- import_url %>%
mutate(DATE = mdy(DATE)) %>%
mutate(TIME = hms(TIME))
# Character to Factors - changes all character columns to factor
import_url <- import_url %>%
mutate(across(where(is.character), as.factor))
str(import_url)
## 'data.frame': 20126 obs. of 15 variables:
## $ INCIDENT_KEY: int 236168668 231008085 230717903 237712309 224465521 228252164 226950018 237710987 224701998 225295736 ...
## $ DATE : Date, format: "2021-11-11" "2021-07-16" ...
## $ TIME :Formal class 'Period' [package "lubridate"] with 6 slots
## .. ..@ .Data : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ year : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ month : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ day : num 0 0 0 0 0 0 0 0 0 0 ...
## .. ..@ hour : num 15 22 1 13 20 4 21 19 0 6 ...
## .. ..@ minute: num 4 5 9 42 0 13 8 30 18 15 ...
## $ BOROUGH : Factor w/ 5 levels "BRONX","BROOKLYN",..: 2 2 2 2 4 4 1 1 3 2 ...
## $ PRECINCT : int 79 72 79 81 113 113 42 52 34 75 ...
## $ LOCATION : Factor w/ 40 levels "","ATM","BANK",..: 1 1 1 1 1 1 10 1 1 1 ...
## $ MURDER_FLAG : Factor w/ 2 levels "false","true": 1 1 1 1 1 2 2 1 1 2 ...
## $ PERP_AGE : Factor w/ 10 levels "","<18","1020",..: 1 7 2 1 1 1 1 1 1 6 ...
## $ PERP_SEX : Factor w/ 4 levels "","F","M","U": 1 3 3 1 1 1 1 1 1 3 ...
## $ PERP_RACE : Factor w/ 8 levels "","AMERICAN INDIAN/ALASKAN NATIVE",..: 1 3 4 1 1 1 1 1 1 5 ...
## $ VICTIM_AGE : Factor w/ 6 levels "<18","18-24",..: 2 3 3 3 3 3 2 3 3 3 ...
## $ VICTIM_SEX : Factor w/ 3 levels "F","M","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ VICTIM_RACE : Factor w/ 7 levels "AMERICAN INDIAN/ALASKAN NATIVE",..: 3 2 3 3 3 3 3 3 4 7 ...
## $ LATITUDE : num 40.7 40.6 40.7 40.7 40.7 ...
## $ LONGITUDE : num -74 -74 -74 -73.9 -73.8 ...
We will continue to look at the features and see if any of these blank entries will cause trouble during the analysis. Also, we’ll look to see if there are any duplicate categorical factors in the rest of the features.
# Create a table of each column to check factor levels
for (i in 1:length(import_url)){
ifelse(is.factor(import_url[ ,i]), print(table(import_url[ ,i, drop = FALSE])), next)
}
## BOROUGH
## BRONX BROOKLYN MANHATTAN QUEENS STATEN ISLAND
## 5611 8357 2536 3031 591
## LOCATION
## ATM BANK
## 11964 1 2
## BAR/NIGHT CLUB BEAUTY/NAIL SALON CANDY STORE
## 400 77 6
## CHAIN STORE CHECK CASH CLOTHING BOUTIQUE
## 5 1 11
## COMMERCIAL BLDG DEPT STORE DOCTOR/DENTIST
## 194 5 1
## DRUG STORE DRY CLEANER/LAUNDRY FACTORY/WAREHOUSE
## 7 24 5
## FAST FOOD GAS STATION GROCERY/BODEGA
## 77 50 468
## GYM/FITNESS FACILITY HOSPITAL HOTEL/MOTEL
## 3 37 25
## JEWELRY STORE LIQUOR STORE LOAN COMPANY
## 9 24 1
## MULTI DWELL - APT BUILD MULTI DWELL - PUBLIC HOUS NONE
## 2007 3664 140
## PHOTO/COPY STORE PVT HOUSE RESTAURANT/DINER
## 1 641 153
## SCHOOL SHOE STORE SMALL MERCHANT
## 1 4 18
## SOCIAL CLUB/POLICY LOCATI STORAGE FACILITY STORE UNCLASSIFIED
## 43 1 28
## SUPERMARKET TELECOMM. STORE VARIETY STORE
## 14 2 10
## VIDEO STORE
## 2
## MURDER_FLAG
## false true
## 16604 3522
## PERP_AGE
## <18 1020 18-24 224 25-44 45-64 65+ 940 UNKNOWN
## 8120 993 1 4086 1 3769 374 44 1 2737
## PERP_SEX
## F M U
## 8090 212 10469 1355
## PERP_RACE
## AMERICAN INDIAN/ALASKAN NATIVE
## 8090 2
## ASIAN / PACIFIC ISLANDER BLACK
## 88 7871
## BLACK HISPANIC UNKNOWN
## 803 1627
## WHITE WHITE HISPANIC
## 213 1432
## VICTIM_AGE
## <18 18-24 25-44 45-64 65+ UNKNOWN
## 1984 7585 9122 1273 125 37
## VICTIM_SEX
## F M U
## 1522 18598 6
## VICTIM_RACE
## AMERICAN INDIAN/ALASKAN NATIVE ASIAN / PACIFIC ISLANDER
## 8 260
## BLACK BLACK HISPANIC
## 14742 1854
## UNKNOWN WHITE
## 50 517
## WHITE HISPANIC
## 2695
Looks like there are quite a few blank entries and a few labeled as “UNKNOWN”.
LOCATION and PERP_AGE data
is unknown.BOROUGH, MURDER_FLAG,
PRECINCT, DATE, TIME,
LONGITUDE, and LATITUDE have no missing
entries.We will combine these by labeling all blanks as “UNKNOWN”. This data
likely comes from officers on the scene who may have: 1. missed some
information; 2. did not have a witness and/or did not catch the
offender; or 3. not have previously recorded this particular data but
now are due to process changes. It’s reasonable to leave these missing
data points in because the missing data also gives us information about
that incident. Also, removing these data points due to their empty
entries would be a mistake considering the data that is complete holds
more relevance. Removing it would be like throwing away the baby with
the bathwater (e.g. removing a data point that is missing the
LOCATION description, but isn’t missing the rest of the
information will only hurt our analysis. Especially considering
BOROUGH,LONGITUDE, andLATITUDE
aren’t missing).
Here we will combine and correct any missing entries with the methods discussed above.
for (i in 1:length(import_url)){
# IF the column is a factor AND contains empty space OR a 'U'
if(is.factor(import_url[ ,i]) && '' %in% import_url[ ,i] || 'U' %in% import_url[ ,i]){
# Add level named UNKNOWN
levels(import_url[, i]) <- c(levels(import_url[, i]), 'UNKNOWN')
# Replace missing values and Us with UNKNWON
import_url[, i][import_url[, i] == ''] <- as.factor('UNKNOWN')
import_url[, i][import_url[, i] == 'U'] <- as.factor('UNKNOWN')
# Remove unused levels
import_url[, i] <- droplevels(import_url[, i])
}else{
next
}
}
Lastly there are a few values in PERP_AGE that look like
data entry typos. See table above:(‘1020’, ‘224’, ‘940’) Here, we cannot
assume what was intended so we will change these age values to
‘UNKNOWN’.
# Before changes
table(import_url$PERP_AGE)
##
## <18 1020 18-24 224 25-44 45-64 65+ 940 UNKNOWN
## 993 1 4086 1 3769 374 44 1 10857
# Set values we want to keep as levels
age_levels <- c('<18', '18-24', '25-44', '45-64', '65+', 'UNKNOWN')
# Find all values NOT in age_levels (notice the !)
import_url$PERP_AGE[!import_url$PERP_AGE %in% age_levels] <- as.factor('UNKNOWN')
# Remove unused levels
import_url$PERP_AGE <- droplevels(import_url$PERP_AGE)
# After changes
table(import_url$PERP_AGE)
##
## <18 18-24 25-44 45-64 65+ UNKNOWN
## 993 4086 3769 374 44 10860
Let’s take a look at a summary of this data now that we’ve cleaned it up.
summary(import_url)
## INCIDENT_KEY DATE TIME
## Min. : 9953245 Min. :2006-01-01 Min. :0S
## 1st Qu.: 62525686 1st Qu.:2009-06-07 1st Qu.:3H 25M 0S
## Median : 87157256 Median :2012-10-09 Median :15H 3M 30S
## Mean :113047411 Mean :2013-07-03 Mean :12H 39M 18.7757130080936S
## 3rd Qu.:166867508 3rd Qu.:2017-07-07 3rd Qu.:20H 45M 0S
## Max. :238490103 Max. :2021-12-31 Max. :23H 59M 0S
##
## BOROUGH PRECINCT LOCATION
## BRONX :5611 Min. : 1.00 UNKNOWN :11964
## BROOKLYN :8357 1st Qu.: 44.00 MULTI DWELL - PUBLIC HOUS: 3664
## MANHATTAN :2536 Median : 69.00 MULTI DWELL - APT BUILD : 2007
## QUEENS :3031 Mean : 66.32 PVT HOUSE : 641
## STATEN ISLAND: 591 3rd Qu.: 81.00 GROCERY/BODEGA : 468
## Max. :123.00 BAR/NIGHT CLUB : 400
## (Other) : 982
## MURDER_FLAG PERP_AGE PERP_SEX
## false:16604 <18 : 993 F : 212
## true : 3522 18-24 : 4086 M :10469
## 25-44 : 3769 UNKNOWN: 9445
## 45-64 : 374
## 65+ : 44
## UNKNOWN:10860
##
## PERP_RACE VICTIM_AGE VICTIM_SEX
## AMERICAN INDIAN/ALASKAN NATIVE: 2 <18 :1984 F : 1522
## ASIAN / PACIFIC ISLANDER : 88 18-24 :7585 M :18598
## BLACK :7871 25-44 :9122 UNKNOWN: 6
## BLACK HISPANIC : 803 45-64 :1273
## UNKNOWN :9717 65+ : 125
## WHITE : 213 UNKNOWN: 37
## WHITE HISPANIC :1432
## VICTIM_RACE LATITUDE LONGITUDE
## AMERICAN INDIAN/ALASKAN NATIVE: 8 Min. :40.51 Min. :-74.25
## ASIAN / PACIFIC ISLANDER : 260 1st Qu.:40.67 1st Qu.:-73.94
## BLACK :14742 Median :40.70 Median :-73.92
## BLACK HISPANIC : 1854 Mean :40.74 Mean :-73.91
## UNKNOWN : 50 3rd Qu.:40.82 3rd Qu.:-73.88
## WHITE : 517 Max. :40.91 Max. :-73.70
## WHITE HISPANIC : 2695
Changing many of these features from characters into factors really improves R’s ability to summarize the data here. We can make many conclusions about this data set just from looking at the summary.
DATE is slightly right-skewed, with
a mean 9 months in the future of the median. This suggests there were
more incidents from 2006-2014 than 2014-2021.Let’s take a closer look and check the DATE distribution
skewness by finding the midpoint date of the data set (this is the exact
middle date of this data set).
data_interval <- interval(min(import_url$DATE), max(import_url$DATE))
int_start(data_interval) + (int_end(data_interval) - int_start(data_interval)) / 2
## [1] "2013-12-31 12:00:00 UTC"
median(import_url$DATE)
## [1] "2012-10-09"
Interesting, this proves there were more incidents in the first half of the time interval of the data set. We will look further into this during visualizations.
Now that the transformations are complete, let’s start plotting these features against each other and visualizing our data so we can try to make some conclusions and more clearly inform them.
Let’s begin by visualizing the number of shooting incidents by the
BOROUGH they occurred.
# Order factor levels of BOROUGH based on frequency
import_url$BOROUGH <- fct_infreq(import_url$BOROUGH)
# Plot graph
import_url %>%
ggplot(., aes(x = reorder(BOROUGH, BOROUGH, length, decreasing = TRUE), fill = BOROUGH)) +
geom_bar(aes(y = after_stat(count))) +
scale_fill_brewer(palette = 'YlOrRd', direction = -1) +
theme(axis.text.x = element_text(angle = 25, hjust = 1)) +
labs(title = 'Number of Shooting Incidents by Borough',
x = 'Borough', y = 'Number of Incidents',
caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
As you can see the majority of shooting incidents have occurred in Brooklyn followed by the Bronx and Queens.
Now to plot the number of shooting incidents organized by the
LOCATION description assigned to them.
import_url %>%
ggplot(., aes(x = reorder(LOCATION, LOCATION, length, decreasing = TRUE), fill = LOCATION)) +
geom_bar(aes(y = after_stat(count)), show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
scale_y_log10() +
labs(title = 'Number of Shooting Incidents by Location Description',
x = 'Location Description', y = 'Number of Incidents (Log Scaled)',
caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
Something to note, this graph has been logarithmically scaled on the
y-axis for easier viewing.
# Group by month and total # of incidents
inc_month_totals <- import_url %>%
group_by(month_totals = months(DATE)) %>%
summarize(n = n()) %>%
arrange(match(month_totals, month.name)) %>%
mutate(month_totals = factor(month_totals, levels = month.name))
# Plot the tibble
month_plot <- ggplot(inc_month_totals, aes(x = month_totals, y = n, fill = month_totals)) +
geom_col()+
scale_fill_brewer(palette = 'Paired', direction = 1, name = 'Month') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = 'Shooting Incidents per Month',
x = 'Month', y = 'Number of Incidents',
caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
# Group by hour and total # of incidents
inc_hour_totals <- import_url %>%
group_by(hour_totals = hour(TIME)) %>%
summarize(n = n()) %>%
mutate(hour_totals = factor(hour_totals))
# Plot
hour_plot <- ggplot(inc_hour_totals, aes(x = hour_totals, y = n, fill = hour_totals)) +
geom_col(show.legend = FALSE) +
labs(title = 'Shooting Incidents per Hour of the Day',
x = 'Hour (24H)', y = 'Number of Incidents',
caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
# Plot side-by-side
grid.arrange(month_plot, hour_plot, nrow = 1)
Looking at these makes me ask the question, “have shootings becoming more or less frequent over the years”. Let’s find out what this data suggests the answer is next.
inc_year_totals <- import_url %>%
group_by(year_totals = year(DATE)) %>%
summarize(n = n()) %>%
mutate(year_totals = factor(year_totals))
year_plot <- ggplot(inc_year_totals, aes(x = year_totals, y = n, fill = year_totals)) +
geom_col(show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 25, hjust = 1)) +
labs(title = 'Shooting Incidents per Year',
x = 'Year', y = 'Number of Incidents',
caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
year_plot
Let’s take a closer look at 2020 and 2021 but include the months to see if the pandemic had any noticeable influence. We’ll also include 2018-2021 for context.
covid_slice <- import_url %>%
filter(DATE > '2018-01-01') %>%
mutate(MONTH = stamp('January', orders = '%B', quiet = TRUE)(DATE)) %>%
mutate(YEAR = stamp('2020', orders = 'y', quiet = TRUE)(DATE))
covid_months_totals <- covid_slice %>%
group_by(MONTH, YEAR) %>%
summarize(n = sum(n())) %>%
arrange(match(MONTH, month.name)) %>%
mutate(MONTH = factor(MONTH, levels = month.name))
## `summarise()` has grouped output by 'MONTH'. You can override using the
## `.groups` argument.
ggplot(covid_months_totals, aes(x = MONTH, y = n, fill = YEAR)) +
geom_col() +
scale_fill_brewer(palette = 'Spectral', direction = 1, name = 'Year') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = 'Stacked Barchart Shooting Incidents per Year 2018-2021',
x = 'Month', y = 'Number of Incidents',
caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
Interesting, here it shows there was an increase in shooting incidents beginning in May and June of 2020, coming off a downward trend from previous years. Let’s take another view here and include the date when New York City initiated their lockdown to see if it aligns with this increase.
covid_slice_2 <- import_url %>%
filter(DATE > '2018-01-01') %>%
mutate(DATE = stamp('2020-01', orders = 'ym', quiet = TRUE)(DATE))
covid_months_totals_2 <- covid_slice_2 %>%
group_by(DATE) %>%
summarize(n = sum(n()))
ggplot(covid_months_totals_2, aes(x = DATE, y = n, fill = DATE)) +
geom_col(show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
geom_vline(xintercept = 27.25) +
annotate('label', x = 27.25, y = 175, angle = 90, color = 'black', label = 'Covid Pandemic Lockdown Initiated') +
annotate("rect", xmin = 0, xmax = 27.25, ymin = 0, ymax = 250, alpha = .2, fill = "#00c9d0") +
annotate("rect", xmin = 27.25, xmax = 48.5, ymin = 0, ymax = 250, alpha = .2, fill = "#c42a07") +
labs(title = 'Shooting Incidents per Year 2018-2021',
x = 'Month', y = 'Number of Incidents',
caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
There certainly is some underlying correlation here, but from this data we can’t say for sure what that is exactly. That said, this is an interesting result because the increase in shootings seems to correlate to the period of time where the pandemic lockdown in NYC would have been in affect for over a month and the increase in shootings occurs when restrictions would have been begun to loosen.
For reference here is a map of NYC’s boroughs in the public domain from Wikipedia: https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file/5_Boroughs_Labels_New_York_City_Map.svg
reference_map <- 'https://commons.wikimedia.org/w/index.php?title=Special:Redirect/file/5_Boroughs_Labels_New_York_City_Map.svg'
knitr::include_graphics(reference_map)
Labels on borough reference map:
Here we’re going to visualize the location of each shooting incident
using the coordinates given in the dataset. First, we can use the
minimum and maximum values of the longitudes and latitudes to find the
map’s bounding box (edges). Then, use ggmap() to generate a
map centered around these coordinates. Then, we can use
geom_point() and stat_density2d_filled() to
superimpose our data on the map using the same coordinate system we
generated.
# Initialize the bounding box that will contain the map view edges
map_bounds <- c(
left = min(import_url$LONGITUDE),
bottom = min(import_url$LATITUDE),
right = max(import_url$LONGITUDE),
top = max(import_url$LATITUDE))
# Initialize the map of NYC using map_bounds
# Note, there are better maps out there but most require a private google API key,
# which wouldn't work for this public, knit-able, project.
incident_map_point <- ggmap(get_stamenmap(map_bounds, maptype = 'terrain', zoom = 11)) +
# Overlay each data point using LONG. and LAT.
geom_point(data = import_url,
aes(x = LONGITUDE, y = LATITUDE),
color = 'darkred',
size = 0.25,
alpha = 0.5) +
ggtitle('Point Plot of NYPD Shooting Incident Reporting 2006 - 2021') +
labs(x = 'Longitude', y = 'Latitude',
caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
incident_map_point_color <- ggmap(get_stamenmap(map_bounds, maptype = 'terrain', zoom = 11)) +
# Overlay each data point using LONG. and LAT.
geom_point(data = import_url,
aes(x = LONGITUDE, y = LATITUDE, color = BOROUGH),
size = 0.25,
alpha = 0.5) +
scale_color_brewer(palette = 'Set1', direction = 1, name = 'Borough') +
theme(legend.position = 'bottom') +
guides(color = guide_legend(override.aes = list(size = 5, alpha = 1))) +
ggtitle('Point Plot of NYPD Shooting Incident Reporting 2006 - 2021') +
labs(x = 'Longitude', y = 'Latitude',
caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
# Display Point Map
grid.arrange(incident_map_point, incident_map_point_color, ncol = 2)
This looks great and gives a much deeper understanding of the spacial
distribution of the shootings, much more than the previous borough
plot.
The point map does have limitations though. There is a loss of information when many points overlap. It’s hard to compare the higher density areas to one another (e.g. we know from the previous graph that Brooklyn has >2000 more incidents than The Bronx, but here you really can’t parse that out). Let’s try to convert this into a heat/density map to get a better picture of these higher density areas.
# Initialize density map to better visualize regions with frequent incidents.
incident_map_density <- ggmap(get_stamenmap(map_bounds, maptype = 'terrain', zoom = 11)) +
stat_density2d_filled(data = import_url,
contour_var = 'density',
aes(x = LONGITUDE, y = LATITUDE, fill = after_stat(level)),
bins = 20,
geom = 'polygon',
alpha = 0.8) +
geom_density_2d(data = import_url,
aes(x = LONGITUDE, y = LATITUDE),
bins = 20,
alpha = 0.2,
color = "white") +
guides(fill = guide_legend(title = "Density")) +
ggtitle('Density Plot of NYPD Shooting Incident Reporting 2006 - 2021') +
labs(x = 'Longitude', y = 'Latitude',
caption = 'Source:<https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic>')
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
# Display Density Map
incident_map_density
Even features that initially seemed sparse, considering the number of missing entries, led to valuable insights into this data.
Some takeaways we’ve seen from above:
Much of the bias of this analysis comes from choosing what features to include or not and possibly what conclusions can be drawn. My experiences while living in two different metropolitan cities for half my life or my past experiences with police officers could also affect the lens I view this information. Also, ever the optimist I assumed shootings would have seen a reduction during the pandemic, but seeing the data now I can see another story. In an attempt to mitigate these biases I’ve chosen to only include what I believe the data to empirically show, and not obfuscate any methods in obtaining these results.
Another bias that is likely present here is how the data was collected, reviewed, and updated. We don’t know the exact method in this case, and it is likely working off of hundreds of police officers’ efforts over the years who went about their work with their own biases. These are all things to keep in mind when working with these datasets and drawing conclusions from them. Hopefully I’ve made the conclusions made here reproducible and clear.